Analysis of Video Game Data


The dataset I will be using for this analysis is video game data collected from vgchartz.com. This dataset contains a list of video games with sales greater than 100,000 copies. This data includes:

Rank - Overall rank of the game, determined by total sales made

Name - The name of the video game

Platform - The type of system the game is played on

Year - The year the video game was made

Genre - The genre of game

Publisher - The company that made the game

NA_Sales - Number of sales in North America

EU_Sales - Number of sales in Europe

JP_Sales - Number of sales in Japan

Other_Sales - Number of sales in countries other than the above

Global_Sales - The total number of sales

For this project I wanted to know which factors contributed the most to a best-selling video game and why. The objectives of this final project are to a) test hypothesis through analyzing the data and b) portray results new R package, plotly.

After cleaning our data, we’ll take a look at the statistics of our dataset.

glimpse(cvg)
## Rows: 66,392
## Columns: 9
## $ Rank         <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, ...
## $ Name         <chr> "Wii Sports", "Wii Sports", "Wii Sports", "Wii Sports"...
## $ Platform     <chr> "Wii", "Wii", "Wii", "Wii", "NES", "NES", "NES", "NES"...
## $ Year         <dbl> 2006, 2006, 2006, 2006, 1985, 1985, 1985, 1985, 2008, ...
## $ Genre        <chr> "Sports", "Sports", "Sports", "Sports", "Platform", "P...
## $ Publisher    <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "Ninte...
## $ Global_Sales <dbl> 82.74, 82.74, 82.74, 82.74, 40.24, 40.24, 40.24, 40.24...
## $ Country      <chr> "North America", "Europe", "Japan", "Other", "North Am...
## $ Sales        <dbl> 41.49, 29.02, 3.77, 8.46, 29.08, 3.58, 6.81, 0.77, 15....
summary(cvg)
##       Rank           Name             Platform              Year     
##  Min.   :    1   Length:66392       Length:66392       Min.   :1980  
##  1st Qu.: 4151   Class :character   Class :character   1st Qu.:2003  
##  Median : 8300   Mode  :character   Mode  :character   Median :2007  
##  Mean   : 8301                                         Mean   :2006  
##  3rd Qu.:12450                                         3rd Qu.:2010  
##  Max.   :16600                                         Max.   :2020  
##                                                        NA's   :1084  
##     Genre            Publisher          Global_Sales       Country         
##  Length:66392       Length:66392       Min.   : 0.0100   Length:66392      
##  Class :character   Class :character   1st Qu.: 0.0600   Class :character  
##  Mode  :character   Mode  :character   Median : 0.1700   Mode  :character  
##                                        Mean   : 0.5374                     
##                                        3rd Qu.: 0.4700                     
##                                        Max.   :82.7400                     
##                                                                            
##      Sales        
##  Min.   : 0.0000  
##  1st Qu.: 0.0000  
##  Median : 0.0100  
##  Mean   : 0.1343  
##  3rd Qu.: 0.0900  
##  Max.   :41.4900  
## 
skim(cvg)
Data summary
Name cvg
Number of rows 66392
Number of columns 9
_______________________
Column type frequency:
character 5
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 1 132 0 11493 0
Platform 0 1 2 4 0 31 0
Genre 0 1 4 12 0 12 0
Publisher 0 1 3 38 0 579 0
Country 0 1 5 13 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Rank 0 1.00 8300.61 4791.75 1.00 4151.00 8300.50 12450.00 16600.00 ▇▇▇▇▇
Year 1084 0.98 2006.41 5.83 1980.00 2003.00 2007.00 2010.00 2020.00 ▁▁▃▇▂
Global_Sales 0 1.00 0.54 1.55 0.01 0.06 0.17 0.47 82.74 ▇▁▁▁▁
Sales 0 1.00 0.13 0.52 0.00 0.00 0.01 0.09 41.49 ▇▁▁▁▁
cvg %>%
  summarize(sum(Global_Sales), mean(Global_Sales))
## # A tibble: 1 x 2
##   `sum(Global_Sales)` `mean(Global_Sales)`
##                 <dbl>                <dbl>
## 1              35682.                0.537
unique(cvg$Platform)
##  [1] "Wii"  "NES"  "GB"   "DS"   "X360" "PS3"  "PS2"  "SNES" "GBA"  "3DS" 
## [11] "PS4"  "N64"  "PS"   "XB"   "PC"   "2600" "PSP"  "XOne" "GC"   "WiiU"
## [21] "GEN"  "DC"   "PSV"  "SAT"  "SCD"  "WS"   "NG"   "TG16" "3DO"  "GG"  
## [31] "PCFX"
unique(cvg$Year)
##  [1] 2006 1985 2008 2009 1996 1989 1984 2005 1999 2007 2010 2013 2004 1990 1988
## [16] 2002 2001 2011 1998 2015 2012 2014 1992 1997 1993 1994 1982 2003 1986 2000
## [31]   NA 1995 2016 1991 1981 1987 1980 1983 2020 2017
unique(cvg$Genre)
##  [1] "Sports"       "Platform"     "Racing"       "Role-Playing" "Puzzle"      
##  [6] "Misc"         "Shooter"      "Simulation"   "Action"       "Fighting"    
## [11] "Adventure"    "Strategy"
unique(cvg$Country)
## [1] "North America" "Europe"        "Japan"         "Other"

By looking at this dataset, it is clear that the rank of the game is based off how well the game did in terms of sales. Sales and Rank are our dependent variables, while Platform, Year, Genre, and Publisher are our independent variables. We want to know which of our independent variables contributes the most to the Global Sales/Rank so that we can make predictions. We will make several graphs examining each variable and its relationship with our dependent variables in order to know what to test.

These two graphs show the correlation between Rank and Global Sales between different countries. The second graph is interactive, courtesy of the Plotly package. By hovering over different coordinates, you can see the exact number of sales in each country , as well as the exact rank.

The positive correlation between Rank and Sales is obvious when looking at this graph. But what the graph also sheds light on is the fact there were the most sales in North America. This could be due to the US being more entertainment-based than Japan or Europe. The population of Japan is smaller than Europe or America so that definitely could be a factor here in this graph. Let’s look at another graph.

This graph shows the most popular genre of video game is Action with 13,264 counts. Sports is a not-so close second place with 9,384 counts. The puzzle genre showed up the least amount compared to the others, with only 2328 counts. These results make sense, as the term “Action” can apply to a large variety of different games. The Puzzle genre seems like it could be quite limited, with less games in production, and this would explain its relatively low count. An explanation for the success of the Sports genre could be that it has games with rights to the NFL which had a huge fanbase already, resulting in a sales increase. Let’s look at another graph.

This graph shows which platforms appear in our data the most. It appears the Nintendo DS and PS2 sold the most compared to the other platforms. The PS2 is already widely regarded as the most successful video game console as it has sold more game copies than any other console. Most of this success is due to the console being the first revolutionary first installment in the world of modern gaming. It also functioned as a DVD player, so its flexible usage appealed to a huge audience.

This graph combines both Platform and Genre. By looking closely at the graph we can see most a large chunk of PS2 sales came from the Sports genre which, as we saw previously, was a popular genre.

This bargraph shows how frequently each year appears on the list. It appears most of the games in the dataset were developed from 2001 to 2011.

This graph shows that the actual sales count was the highest with games made around 1987 to 1989. We can double check the data.

##   sum(sal$Global_Sales) mean(sal$Global_Sales)
## 1                 47.22                  3.148

Our data shows us that 1988 definitely does have a higher sales average than what we calculated the dataset as a whole to have above. I don’t know why this is the case, but hopefully we will unearth some answers using the data further.

This graph is faceted based on the sales in each year in each country. This graph also shows that there were more sales in North America. Like the above graph, North America and Japan show a peak around 1988. Europe shows a slight bump around that time even though we have seen previously that Europe had more sales than Japan. To explain this trend, one explanation could be some games made around that time must’ve been popularized by Japanese and Western culture and sold better.

top <- slice(vg, 1:100)

We have subset the data to include the top 100 ranked games in the datasheet. This will help us make and test predictions for the rest of the dataset.

glimpse(top)
## Rows: 100
## Columns: 11
## $ Rank         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Name         <chr> "Wii Sports", "Super Mario Bros.", "Mario Kart Wii", "...
## $ Platform     <chr> "Wii", "NES", "Wii", "Wii", "GB", "GB", "DS", "Wii", "...
## $ Year         <dbl> 2006, 1985, 2008, 2009, 1996, 1989, 2006, 2006, 2009, ...
## $ Genre        <chr> "Sports", "Platform", "Racing", "Sports", "Role-Playin...
## $ Publisher    <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "Ninte...
## $ NA_Sales     <dbl> 41.49, 29.08, 15.85, 15.75, 11.27, 23.20, 11.38, 14.03...
## $ EU_Sales     <dbl> 29.02, 3.58, 12.88, 11.01, 8.89, 2.26, 9.23, 9.20, 7.0...
## $ JP_Sales     <dbl> 3.77, 6.81, 3.79, 3.28, 10.22, 4.22, 6.50, 2.93, 4.70,...
## $ Other_Sales  <dbl> 8.46, 0.77, 3.31, 2.96, 1.00, 0.58, 2.90, 2.85, 2.26, ...
## $ Global_Sales <dbl> 82.74, 40.24, 35.82, 33.00, 31.37, 30.26, 30.01, 29.02...
summary(top)
##       Rank            Name             Platform              Year     
##  Min.   :  1.00   Length:100         Length:100         Min.   :1982  
##  1st Qu.: 25.75   Class :character   Class :character   1st Qu.:2001  
##  Median : 50.50   Mode  :character   Mode  :character   Median :2007  
##  Mean   : 50.50                                         Mean   :2005  
##  3rd Qu.: 75.25                                         3rd Qu.:2011  
##  Max.   :100.00                                         Max.   :2015  
##     Genre            Publisher            NA_Sales         EU_Sales     
##  Length:100         Length:100         Min.   : 0.980   Min.   : 0.010  
##  Class :character   Class :character   1st Qu.: 3.862   1st Qu.: 2.315  
##  Mode  :character   Mode  :character   Median : 5.675   Median : 3.385  
##                                        Mean   : 7.050   Mean   : 4.207  
##                                        3rd Qu.: 8.290   3rd Qu.: 5.043  
##                                        Max.   :41.490   Max.   :29.020  
##     JP_Sales       Other_Sales      Global_Sales   
##  Min.   : 0.000   Min.   : 0.080   Min.   : 7.340  
##  1st Qu.: 0.235   1st Qu.: 0.550   1st Qu.: 8.975  
##  Median : 1.635   Median : 0.865   Median :11.255  
##  Mean   : 2.046   Mean   : 1.318   Mean   :14.620  
##  3rd Qu.: 3.280   3rd Qu.: 1.620   3rd Qu.:15.925  
##  Max.   :10.220   Max.   :10.570   Max.   :82.740
top %>%
  summarize(sum(Global_Sales), mean(Global_Sales))
##   sum(Global_Sales) mean(Global_Sales)
## 1           1461.98            14.6198
unique(top$Platform)
##  [1] "Wii"  "NES"  "GB"   "DS"   "X360" "PS3"  "PS2"  "SNES" "GBA"  "3DS" 
## [11] "PS4"  "N64"  "PS"   "XB"   "PC"   "2600" "PSP"

The mean of this subset is significantly higher than the dataset as a whole.

By looking at the subsetted graph, it appears more games are concentrated in the 2010 and 2011 years.

The subset makes plotting these variables against each other easier.

After visualizing the data we are ready to make predictions about our data. We will test the relationship of our independent variables with our dependent variable, Global Sales.

mod1 <- aov(data = top,
            formula = Global_Sales ~ Platform + Year + Publisher + Genre)
residuals(mod1)^2 %>% mean() %>% sqrt()
## [1] 7.12078
summary(mod1)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Platform    16   2278   142.4   1.797 0.0512 .
## Year         1    367   367.2   4.634 0.0351 *
## Publisher    8    650    81.3   1.026 0.4259  
## Genre       10   1408   140.8   1.777 0.0831 .
## Residuals   64   5071    79.2                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA table, we have a slightly significant variable, Year. Platform looks like the next influential, with Genre next. Now we can plot our predictions to see how useful the model is.

pred <- add_predictions(top, mod1) 

The model didn’t do too bad in predicting Global Sales. We can try to extend our model to our full dataset.

residuals(mod3)^2 %>% mean() %>% sqrt()
## [1] 1.525477
summary(mod3)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Year            1    224  223.54   95.81 <2e-16 ***
## Platform       30   1497   49.91   21.39 <2e-16 ***
## Genre          11    308   28.04   12.02 <2e-16 ***
## Residuals   16284  37994    2.33                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 271 observations deleted due to missingness
predd <- add_predictions(vg, mod3) 

When it comes to a larger dataset, the model didn’t do as good a job predicting datapoints. The differences can be seen between the first and second set of graphs. So far it appears that Year is the strongest predictor in what makes a best-selling game. However, since our ANOVA table stated each factor was significant (which is not very helpful), it appears more data is needed to fully answer the question.